9 research outputs found
Regularized and Smooth Double Core Tensor Factorization for Heterogeneous Data
We introduce a general tensor model suitable for data analytic tasks for
heterogeneous data sets, wherein there are joint low-rank structures within
groups of observations, but also discriminative structures across different
groups. To capture such complex structures, a double core tensor (DCOT)
factorization model is introduced together with a family of smoothing loss
functions. By leveraging the proposed smoothing function, the model accurately
estimates the model factors, even in the presence of missing entries. A
linearized ADMM method is employed to solve regularized versions of DCOT
factorizations, that avoid large tensor operations and large memory storage
requirements. Further, we establish theoretically its global convergence,
together with consistency of the estimates of the model parameters. The
effectiveness of the DCOT model is illustrated on several real-world examples
including image completion, recommender systems, subspace clustering and
detecting modules in heterogeneous Omics multi-modal data, since it provides
more insightful decompositions than conventional tensor methods
Margin Maximization in Attention Mechanism
Attention mechanism is a central component of the transformer architecture
which led to the phenomenal success of large language models. However, the
theoretical principles underlying the attention mechanism are poorly
understood, especially its nonconvex optimization dynamics. In this work, we
explore the seminal softmax-attention model , where,
is the token sequence and
are tunable parameters. We
prove that running gradient descent on , or equivalently
, converges in direction to a max-margin solution that
separates tokens from non-optimal ones. This clearly
formalizes attention as a token separation mechanism. Remarkably, our results
are applicable to general data and precisely characterize
of tokens in terms of the value embeddings and problem
geometry. We also provide a broader regularization path analysis that
establishes the margin maximizing nature of attention even for nonlinear
prediction heads. When optimizing and
simultaneously with logistic loss, we identify conditions under which the
regularization paths directionally converge to their respective hard-margin SVM
solutions where separates the input features based on their
labels. Interestingly, the SVM formulation of is influenced by
the support vector geometry of . Finally, we verify our
theoretical findings via numerical experiments and provide insights
A Penalty Based Method for Communication-Efficient Decentralized Bilevel Programming
Bilevel programming has recently received attention in the literature, due to
a wide range of applications, including reinforcement learning and
hyper-parameter optimization. However, it is widely assumed that the underlying
bilevel optimization problem is solved either by a single machine or in the
case of multiple machines connected in a star-shaped network, i.e., federated
learning setting. The latter approach suffers from a high communication cost on
the central node (e.g., parameter server) and exhibits privacy vulnerabilities.
Hence, it is of interest to develop methods that solve bilevel optimization
problems in a communication-efficient decentralized manner. To that end, this
paper introduces a penalty function based decentralized algorithm with
theoretical guarantees for this class of optimization problems. Specifically, a
distributed alternating gradient-type algorithm for solving consensus bilevel
programming over a decentralized network is developed. A key feature of the
proposed algorithm is to estimate the hyper-gradient of the penalty function
via decentralized computation of matrix-vector products and few vector
communications, which is then integrated within our alternating algorithm to
give the finite-time convergence analysis under different convexity
assumptions. Owing to the generality of this complexity analysis, our result
yields convergence rates for a wide variety of consensus problems including
minimax and compositional optimization. Empirical results on both synthetic and
real datasets demonstrate that the proposed method works well in practice